Significant Phrases Detection

نویسندگان

  • Mikhail Bautin
  • Michael Hart
چکیده

The problem of determining key words and phases which best characterize a text document has important applications such as building a compact index for a largescale text processing system, or using a keyword set for summarization and topic detection. We approached this problem from two perspectives. Our knowledgepoor approach is based on statistical collocation detection using the t-test and likelihood ratio, and applying latent semantic analysis to identify terms important in a particular document. The knowledgerich approach addresses the problem using noun phrase chunking and coreference resolution. Both approaches use a decision tree classifier to answer whether a given phrase is a key word looking at the set of calculated features. We have built prototypes and compared results of these two approaches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تعیین مرز و نوع عبارات نحوی در متون فارسی

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...

متن کامل

Identifying Evolutionary Topic Temporal Patterns Based on Bursty Phrase Clustering

We discuss a temporal text mining task on finding evolutionary patterns of topics from a collection of article revisions. To reveal the evolution of topics, we propose a novel method for finding key phrases that are bursty and significant in terms of revision histories. Then we show a time series clustering method to group phrases that have similar burst histories, where additions and deletions...

متن کامل

Towards Sentiment Analysis of Financial Texts in Croatian

The paper presents results of an experiment dealing with sentiment analysis of Croatian text from the domain of finance. The goal of the experiment was to design a system model for automatic detection of general sentiment and polarity phrases in these texts. We have assembled a document collection from web sources writing on the financial market in Croatia and manually annotated articles from a...

متن کامل

A Unified Probabilistic Approach for Semantic Clustering of Relational Phrases

The task of finding synonymous relational phrases is important in natural language understanding problems such as question answering and paraphrase detection. While this task has been addressed by many previous systems, each of these existing approaches is limited either in expressivity or in scalability. To address this challenge, we present a large-scale statistical relational method for clus...

متن کامل

Paraphrase Detection Using Recursive Autoencoder

In this paper, we tackle the paraphrase detection task. We present a novel recursive autoencoder architecture that learns representations of phrases in an unsupervised way. Using these representations, we are able to extract features for classification algorithms that allow us to outperform many results from previous works.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006